Exploring New Languages with HAIRCUT at CLEF 2005
نویسنده
چکیده
JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several nontraditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.
منابع مشابه
Cross Language Evaluation Forum : CLEF 2005 " Gareth Jones and Carol Peters
This presentation will report the activities of the CLEF 2005 evaluation campaign. CLEF 2005 consisted of 8 tracks focusing on topics in multilingual information retrieval. An assessment of the results will be given with particular focus on two important tracks: multilingual ad-hoc retrieval and cross-language search in image collections. The multilingual task this year had two objectives: to e...
متن کاملAd-hoc Mono- and Bilingual Retrieval Experiments at the University of Hildesheim
This paper reports on our participation in CLEF 2005‘s ad-hoc multi-lingual retrieval track. The ad-hoc task introduced Bulgarian and Hungarian as new languages. Our experiments focus on the two new languages. Naturally, no relevance assessments are available for these collections yet. Optimization was mainly based on French data from last year. Based on experience from last year, one of our ma...
متن کاملCombining Passages in the Monolingual Task with the IR-n System
This paper describes our participation in monolingual tasks at CLEF-2005. In this research we have worked in the following languages: English, French, Portuguese, Bulgarian and Hungarian. Our task has been focused on using combined different size passages to improve the Information Retrieval process. Once we have studied the experiments which have been carried out and the official results at CL...
متن کاملCross-Language Retrieval Using HAIRCUT for CLEF 2004
JHU/APL continued to explore the use of knowledge-light methods for scalable multilingual retrieval during the CLEF 2004 evaluation. We relied on the language-neutral techniques of character n-gram tokenization, pre-translation query expansion, statistical translation using aligned parallel corpora, fusion from disparate retrievals, and reliance on language similarity when resources are scarce....
متن کاملDublin City University at CLEF 2005: Cross-Language Spoken Document Retrieval (CL-SR) Experiments
The Dublin City University participation in the CLEF CL-SR 2005 task concentrated on exploring the application of our existing information retrieval methods based on the Okapi model to the conversational speech data set. This required an approach to determining approximate sentence boundaries within the free-flowing automatic transcription provided. We also performed exploratory experiments on ...
متن کامل